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Abstract 

We  are  trying  to  find  paraphrases  from 
Japanese  news  artieles  whieh  ean  be  used 
for  Information  Extraetion.  We  foeused 
on  the  faet  that  a  single  event  ean  be  re¬ 
ported  in  more  than  one  artiele  in  differ¬ 
ent  ways.  However,  eertain  kinds  of  noun 
phrases  sueh  as  names,  dates  and  numbers 
behave  as  “anehors”  whieh  are  unlikely  to 
ehange  aeross  artieles.  Our  key  idea  is  to 
identify  these  anehors  among  eomparable 
artieles  and  extraet  portions  of  expressions 
whieh  share  the  anehors.  This  way  we 
ean  extraet  expressions  whieh  eonvey  the 
same  information.  Obtained  paraphrases 
are  generalized  as  templates  and  stored  for 
future  use. 

In  this  paper,  first  we  deseribe  our  ba- 
sie  idea  of  paraphrase  aequisition.  Our 
method  is  divided  into  roughly  four  steps, 
eaeh  of  whieh  is  explained  in  turn.  Then 
we  illustrate  several  issues  whieh  we  en- 
eounter  in  real  texts.  To  solve  these  prob¬ 
lems,  we  introduee  two  teehniques:  eoref- 
erenee  resolution  and  struetural  restrietion 
of  possible  portions  of  expressions.  Fi¬ 
nally  we  diseuss  the  experimental  results 
and  eonelusions. 

1  Introduction 

We  are  trying  to  obtain  paraphrases  whieh  ean  be 
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systems  sean  artieles  and  retrieve  speeifie  informa¬ 
tion  whieh  is  required  for  a  eertain  domain  defined 
in  advanee.  Currently,  many  IE  tasks  are  performed 
by  pattern  matehing.  For  example,  if  the  system  re- 
eeives  a  sentenee  “Two  more  people  have  died  in 
Hong  Kong  from  SARS,”  and  the  system  has  a  pat¬ 
tern  “NUMBER  people  die  in  LOCATION'  in  its  in¬ 
ventory,  then  the  system  ean  apply  the  pattern  to 
the  sentenee  and  fill  the  slots,  and  obtain  informa¬ 
tion  sueh  as  “NUMBER  =  two  more,  EOCATION  = 
Hong  Kong”.  In  most  IE  systems,  the  performanee 
of  the  system  is  dependent  on  these  well-designed 
patterns. 

In  natural  language  sentenees,  a  single  event  ean 
be  expressed  in  many  different  ways.  So  we  need 
to  prepare  patterns  for  various  kinds  of  expressions 
used  in  artieles.  We  are  interested  in  elustering 
IE  patterns  whieh  eapture  the  same  information. 
For  example,  a  pattern  sueh  as  “LOCATION  reports 
NUMBER  deaths”  ean  be  used  for  the  same  purpose 
as  the  previous  one,  sinee  this  pattern  eould  also  eap¬ 
ture  the  easualties  oeeurring  in  a  eertain  loeation. 
Prior  work  to  relate  two  IE  patterns  was  reported  by 
(Shinyama  et  al.,  2002).  However,  in  this  attempt 
only  limited  forms  of  expressions  eould  be  obtained. 
Furthermore,  the  obtained  paraphrases  were  limited 
to  existing  IE  patterns  only.  We  are  interested  in  eol- 
leeting  various  kinds  of  elues,  ineluding  similar  IE 
patterns  themselves,  to  eonneet  two  patterns.  In  this 
paper,  we  tried  to  obtain  more  varied  paraphrases. 
Although  our  eurrent  method  is  intended  for  use  in 
Information  Extraetion,  we  think  the  same  approaeh 
ean  be  applied  to  obtain  paraphrases  for  other  pur¬ 
poses,  sueh  as  maehine  translation  or  text  siimma- 
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rization. 

There  have  been  several  attempts  to  obtain  para¬ 
phrases.  (Barzilay  and  MeKeown,  2001)  applied 
text  alignment  to  parallel  translations  of  a  single 
text  and  used  a  part-of-speeeh  tagger  to  obtain  para¬ 
phrases.  (Lin  and  Pantel,  2001)  used  mutual  infor¬ 
mation  of  word  distribution  to  ealeulate  the  simi¬ 
larity  of  expressions.  (Pang  et  ah,  2003)  also  used 
text  alignment  and  obtained  a  finite  state  automaton 
whieh  generates  paraphrases.  (Raviehandran  and 
Hovy,  2002)  used  pairs  of  questions  and  answers 
to  obtain  varied  patterns  whieh  give  the  same  an¬ 
swer.  Our  approaeh  is  different  from  these  works  in 
that  we  used  eomparable  news  artieles  as  a  souree  of 
paraphrases  and  used  Named  Entity  tagging  and  de- 
pendeney  analysis  to  extraet  eorresponding  expres¬ 
sions. 

2  Overall  Procedure  of  Paraphrase 
Acquisition 

Our  main  goal  is  to  obtain  pattern  elusters  for  IE, 
whieh  eonsist  of  sets  of  equivalent  patterns  eaptur- 
ing  the  same  information.  So  we  tried  to  diseover 
paraphrases  eontained  in  Japanese  news  artieles  for 
a  speeifie  domain.  Our  basie  idea  is  to  seareh  news 
artieles  from  the  same  day.  We  foeused  on  the  faet 
that  various  newspapers  deseribe  a  single  event  in 
different  ways.  So  if  we  ean  diseover  an  event 
whieh  is  reported  in  more  then  one  newspaper,  we 
ean  hope  these  artieles  ean  be  used  as  the  souree  of 
paraphrases.  Eor  example,  the  following  artieles  ap¬ 
peared  in  “Health”  seetions  in  different  newspapers 
on  Apr.  1 1 : 

1.  “The  government  has  announeed  that  two  more 
people  have  died  in  Hong  Kong  after  eontraet- 
ing  the  SARS  virus  and  61  new  eases  of  the 
illness  have  been  deteeted.”  (Reuters,  Apr.  11) 

2.  “Hong  Kong  reported  two  more  deaths  and  61 
fresh  eases  of  SARS  Eriday  as  governments 
aeross  the  world  took  tough  steps  to  stop  the 
killer  virus  at  their  borders.”  ( Channel  News 
Asia,  Apr.  11) 

In  these  artieles,  we  ean  find  several  eorrespond¬ 
ing  parfs,  sueh  as  “NUMBER  people  have  died  in 
LOCATlOlU  and  “LOCATION  reporfed  NUMBER 


deafhs”.  Allhough  Iheir  synlaelie  slruelures  are  dif- 
ferenl,  Ihey  still  eonvey  Ihe  same  single  fael.  Here 
if  is  worlh  noting  lhal  even  if  a  differenl  expression 
is  used,  some  noun  phrases  sueh  as  “Hong  Kong” 
or  “Iwo  more”  are  preserved  aeross  Ihe  Iwo  arli- 
eles.  We  found  lhal  Ihese  words  shared  by  Ihe  Iwo 
senlenees  provide  firm  anehors  for  Iwo  differenl  ex¬ 
pressions.  In  parlieular.  Named  Enlilies  (NEs)  sueh 
as  names,  loealions,  dales  or  numbers  ean  be  Ihe 
firmesl  anehors  sinee  Ihey  are  indispensable  lo  re- 
porl  an  evenl  and  diflieull  lo  paraphrase. 

We  fried  lo  oblain  paraphrases  by  using  Ihis  prop¬ 
erly.  Eirsl  we  eolleel  a  sel  of  eomparable  artieles 
whieh  reporls  Ihe  same  evenl,  and  pull  appropriate 
portions  oul  of  Ihe  senlenees  whieh  share  Ihe  same 
anehors.  If  we  earefully  ehoose  appropriate  portions 
of  Ihe  senlenees,  Ihe  exlraeled  expressions  will  eon¬ 
vey  Ihe  same  information;  i.e.  Ihey  are  paraphrases. 
After  eorresponding  portions  are  oblained,  we  gen¬ 
eralize  Ihe  expressions  lo  templates  of  paraphrases 
whieh  ean  be  used  in  fulure. 

Our  melhod  is  divided  into  four  steps: 

1.  Eind  eomparable  senlenees  whieh  reporl  Ihe 
same  evenl  from  differenl  newspapers. 

2.  Identify  anehors  in  Ihe  eomparable  senlenees. 

3.  Exlrael  eorresponding  portions  from  Ihe  sen- 
lenees. 

4.  Generalize  Ihe  oblained  expressions  to  para¬ 
phrase  templates. 

Eigure  1  shows  Ihe  overall  proeedure.  In  Ihe  re¬ 
mainder  of  Ihis  seelion,  we  deseribe  eaeh  step  in 
torn. 

2.1  Find  Comparable  Sentences 

To  find  comparable  articles  and  sentences,  we  used 
melhods  developed  for  Topic  Detection  and  Track¬ 
ing  (Wayne,  1998).  The  aclual  process  is  divided 
into  Iwo  parls:  article  level  malching  and  sentence 
level  malching.  Currenlly  we  assume  lhal  a  pair 
of  paraphrases  can  be  found  in  a  single  sentence 
of  each  article  and  corresponding  expressions  donT 
range  across  Iwo  or  more  sentences.  Article  level 
malching  is  lirsl  required  to  narrow  Ihe  search  space 
and  reduce  erroneous  malching  of  anchors. 


(Articles  on 
the  same  day) 


"LOCTION 

reports 

NUMBER 

deaths" 


Find  comparable  Identify  Extract  Generaiize 

articies  and  anchors  corrspondlng  expressions 

sentences  portions 


Figure  1 :  The  overall  proeedure 


Before  applying  this  teehnique,  we  first  prepro- 
eessed  the  artieles  by  stripping  off  the  strings  whieh 
are  not  eonsidered  as  sentenees.  Then  we  used  a 
part-of-speeeh  tagger  to  obtain  segmented  words.  In 
the  aetual  matehing  proeess  we  used  a  method  de- 
seribed  in  (Papka  et  ah,  1999)  to  find  a  set  of  eom- 
parable  artieles.  Then  we  use  a  simple  veetor  spaee 
model  for  sentenee  matehing. 

2.2  Identify  Anchors 

Before  extraeting  paraphrases,  we  find  anehors  in 
eomparable  sentenees.  We  used  Extended  Named 
Entity  tagging  to  identify  anehors.  A  Named  Entity 
tagger  identifies  proper  expressions  sueh  as  names, 
loeations  and  dates  in  sentenees.  In  addition  to  these 
expressions,  an  Extended  Named  Entity  tagger  iden¬ 
tifies  some  eommon  nouns  sueh  as  disease  names  or 
numbers,  fhat  are  also  unlikely  to  ehange  (Sekine 
et  ah,  2002).  Eor  eaeh  eorresponding  pair  of  sen¬ 
tenees,  we  apply  the  tagger  and  identify  the  same 
noun  phrases  whieh  appear  in  both  sentenees  as  an¬ 
ehors. 

2.3  Extract  Corresponding  Sentence  Portions 

Now  we  identify  appropriate  boundaries  of  expres¬ 
sions  whieh  share  the  anehors  identified  in  the  pre¬ 
vious  stage.  To  avoid  extraeting  non-grammatieal 
expressions,  we  operate  on  syntaetieally  struetured 
text  rather  than  sequenees  of  words.  Dependeney 
analysis  is  suitable  for  this  purpose,  sinee  using  de¬ 
pendeney  trees  we  ean  reeonstruet  grammatieally 
eorreet  expressions  from  a  spanning  subtree  whose 
root  is  a  predieate.  Dependeney  analysis  also  allows 
us  to  extraet  expressions  whieh  are  subtrees  but  do 


not  eorrespond  to  a  single  eontiguous  sequenee  of 
words. 

We  applied  a  dependeney  analyzer  to  a  pair  of 
eorresponding  sentenees  and  obtained  tree  struetures 
for  eaeh  sentenee.  Eaeh  node  of  the  tree  is  either  a 
predieate  sueh  as  a  verb  or  an  adjeetive,  or  an  argu¬ 
ment  sueh  as  a  noun  or  a  pronoun.  Eaeh  predieate 
ean  take  one  or  more  arguments.  We  generated  all 
possible  eombinations  of  subtrees  from  eaeh  depen¬ 
deney  tree,  and  eompared  the  anehors  which  are  in¬ 
cluded  in  both  subtrees.  After  a  pair  of  correspond¬ 
ing  subtrees  which  share  the  anchors  is  found,  the 
subtree  pair  can  be  recognized  as  paraphrases.  In  ac¬ 
tual  experiments,  we  put  some  restrictions  on  these 
subtrees,  which  will  be  discussed  later.  This  way 
we  can  obtain  grammatically  well-formed  portions 
of  sentences  (Eigure  2). 

2.4  Generalize  Expressions 

After  corresponding  portions  are  obtained,  we  gen¬ 
eralize  the  expressions  to  form  usable  templates  of 
paraphrases.  Actually  this  is  already  done  by  Ex¬ 
tended  Named  Entity  tagging.  An  Extended  Named 
Entity  tagger  classifies  proper  expressions  into  sev¬ 
eral  categories.  This  is  similar  to  a  part-of-speech 
tagger  as  it  classifies  words  into  several  part-of- 
speech  categories.  Eor  example,  “Hong  Kong”  is 
tagged  as  a  location  name,  and  “two  more”  as  a 
number.  So  an  expression  such  as  “two  more  peo¬ 
ple  die  in  Hong  Kong”  is  finally  converted  into  the 
form  “NUMBER  people  die  in  LOCATION"  where 
NUMBER  and  LOCATION  are  slots  to  fill  in.  This 
way  we  obtain  expressions  which  can  be  used  as  IE 
patterns. 


anchors 


(\//////X  LOCATION  is  included. 
I  I  NUMBER  is  included. 


object 

subject 


Figure  2:  Extracting  portions  of  sentences 


3  Handling  Problems  in  Real  Texts 

In  the  previous  section  we  described  our  method  for 
obtaining  paraphrases  in  principle.  However  there 
are  several  issues  in  actual  texts  which  pose  difficul¬ 
ties  for  our  method. 

The  first  one  is  in  finding  anchors  which  refer  fo 
fhe  same  enfify.  In  acfual  articles,  names  are  some¬ 
time  referred  fo  in  a  slighfly  differenl  form.  For  ex¬ 
ample,  “Presidenf  Bush”  can  also  be  referred  fo  as 
“Mr.  Bush”.  Addifionally,  sometime  if  is  referred 
fo  by  a  pronoun,  such  as  “he”.  Since  our  mefhod 
relies  on  fhe  facl  fhaf  fhose  anchors  are  preserved 
across  articles,  anchors  which  appear  in  fhese  var¬ 
ied  forms  may  reduce  fhe  acfual  number  of  obfained 
paraphrases. 

To  handle  fhis  problem,  we  exfended  fhe  nofion 
of  anchors  fo  include  nol  jusf  Extended  Named  En¬ 
tities,  buf  also  pronouns  and  common  nouns  such 
as  “fhe  presidenf”.  We  used  a  simple  corefer¬ 
ence  resolver  affer  Exfended  Named  Enfify  fag¬ 
ging.  Currenfly  fhis  is  done  by  simply  assigning 
fhe  mosf  recenf  anfecedenf  fo  pronouns  and  finding 
a  longesf  common  subsequence  (ECS)  befween  fwo 
noun  groups.  Since  if  is  possible  fo  form  a  com¬ 
pound  noun  such  as  “Presidenf-Bush”  in  Japanese, 
we  compufed  ECS  for  each  characfer  in  fhe  fwo 
noun  groups.  We  used  fhe  following  condifion  fo 
decide  whefher  fwo  noun  groups  si  and  S2  are  coref- 
erenfial: 

•  if  2  <  min{\si\,\s2\)  <  1^(75(51,  S2)|,  then 
Si  and  S2  are  considered  coreferenfial. 


Here  |s|  denotes  fhe  lengfh  of  noun  group  s  and 
LCS{si,  S2)  is  fhe  ECS  of  fwo  noun  groups  si  and 
S2- 

The  second  problem  is  fo  exfracf  appropriate  por- 
fions  as  paraphrase  expressions.  Since  we  use  a  free 
sfrucfure  fo  represenf  fhe  expressions,  finding  com¬ 
mon  subfrees  may  fake  an  exponential  number  of 
sfeps.  For  example,  if  a  dependency  free  in  one 
article  has  one  single  predicate  which  has  n  argu- 
menfs,  fhe  number  of  possible  subfrees  which  can 
be  obfained  from  fhe  free  is  2"^.  So  fhe  mafching 
process  befween  arbifrary  combinafions  of  subfrees 
may  grow  exponentially  wifh  fhe  lengfh  of  fhe  sen- 
fences.  Even  worse,  if  can  generate  many  combina¬ 
fions  of  sentence  porfions  which  don’f  make  sense  as 
paraphrases.  For  example,  from  fhe  expression  “fwo 
more  people  have  died  in  Hong  Kong”  and  “Hong 
Kong  reporfed  fwo  more  deafhs”,  we  could  exfracf 
expressions  “in  Hong  Kong”  and  “Hong  Kong  re¬ 
porfed”.  Alfhough  bofh  of  fhem  share  one  anchor, 
fhis  is  nof  a  correcf  paraphrase.  To  avoid  fhis  sorf  of 
error,  we  need  fo  puf  some  addifional  resfricfions  on 
fhe  expressions. 

(Shiny ama  ef  ah,  2002)  used  fhe  frequency  of  ex¬ 
pressions  fo  fiber  fhese  incorrecf  pairs  of  expres¬ 
sions.  Firsf  fhe  sysfem  obfained  a  sef  of  IE  pafferns 
from  corpora  (Sudo  and  Sekine,  2001),  and  fhen  cal- 
culafed  fhe  score  for  each  candidafe  paraphrase  by 
counfing  how  many  times  fhaf  expression  appears  as 
an  IE  paffern  in  fhe  whole  corpus.  However,  wifh 
fhis  mefhod,  obfainable  expressions  are  limited  fo 
exisfing  IE  pafferns  only.  Since  we  wanfed  fo  oh- 


tain  a  broader  range  of  expressions  not  limited  to 
IE  patterns  themselves,  we  tried  to  use  other  restric¬ 
tions  which  can  be  acquired  independently  of  the  IE 
system. 

We  partly  solve  this  problem  by  calculating  the 
plausibility  of  each  tree  structure.  In  Japanese  sen¬ 
tences,  the  case  of  each  argument  which  modifies 
a  predicate  is  represented  by  a  case  marker  (post¬ 
position  or  joshi)  which  follows  a  noun  phrase,  just 
like  prepositions  in  English  but  in  the  opposite  order. 
These  arguments  include  subjects  and  objects  that 
are  elucidated  syntactically  in  English  sentences. 
We  collected  frequent  cases  occurring  with  a  spe¬ 
cific  predicafe  in  advance.  We  applied  fhis  resfric- 
fion  when  generating  subfrees  from  a  dependency 
free  by  calculafing  a  score  for  each  predicafe  as  fol¬ 
lows: 

Eef  an  insfance  of  predicafe  p  have  cases  C  = 
{ci,  C2, ...,  Cn}  and  a  function  Np{I)  be  fhe  number 
of  insfances  of  p  in  fhe  corpus  whose  cases  are  I  = 
{ci,  C2, ...,  Cm}-  We  compufe  fhe  score  Sp{C)  of  fhe 
insfance: 

Sp{C)  = - EjcC  _ 

fhe  number  of  insfances  of  p  in  fhe  corpus 

Using  fhis  mefric,  a  predicafe  which  doesn’f  have 
cases  fhaf  if  should  usually  have  is  given  a  lower 
score.  A  subfree  which  includes  a  predicate  whose 
score  is  less  fhan  a  cerfain  fhreshold  is  tillered  oul. 
This  way  we  can  filler  oul  expressions  such  as 
“Hong  Kong  reported”  in  Japanese  since  if  would 
lack  an  objecl  case  which  normally  fhe  verb  “re- 
porl”  should  have.  Moreover,  fhis  greally  reduces 
fhe  number  of  possible  combinalions  of  subfrees. 

4  Experiments 

We  used  Japanese  news  arficles  for  fhis  experi- 
menl.  Eirsl  we  collecled  articles  for  a  specific  do¬ 
main  from  Iwo  differenl  newspapers  {Mainichi  and 
Nikkei).  Then  we  used  a  Japanese  parl-of-speech 
lagger  (Kurohashi  and  Nagao,  1998)  and  Extended 
Named  Enlily  fagger  lo  process  documenls,  and  pul 
Ihem  info  a  Topic  Deleclion  and  Tracking  syslem. 
In  fhis  experimenl,  we  used  a  modified  version  of  a 
Japanese  Exlended  Named  Enlily  fagger  (Uchimolo 
el  al.,  2000).  This  lagger  lags  person  names,  orga- 
nizalion  names,  localions,  dates,  limes  and  numbers. 


Article  pairs: 


Oblained 

Correcl 

System 

195 

156 

(80%) 

Sentence  pairs: 

(from  lop  20  arlicle  pairs) 


Oblained 

Correcl 

Manual 

93 

93 

W/o  coref. 

55 

41  (75%) 

W  coref. 

75 

52  (69%) 

Paraphrase  pairs: 


Oblained 

Correcl 

W/o  coref.  or  reslriclion 

106 

25  (24%) 

W/o  coref.,  w  reslriclion 

32 

18  (56%) 

W  coref.  and  reslriclion 

37 

23  (62%) 

Manual  (in  5  hours) 

(100) 

(100) 

Table  1 :  Resulls  in  fhe  murder  cases  domain 


Sample  1: 

•  PERSON  1  killed  PERSON!. 

•  PERSONl  let  PERSON!  die  from  loss  of  blood. 

Sample  2: 

•  PERSONl  shadowed  PERSON!. 

•  PERSONl  kept  his  eyes  on  PERSON!. 


Eigure  3:  Sample  correcl  paraphrases  oblained 
(Iranslaled  from  Japanese) 


Sample  3: 

•  PERSONl  fled  to  LOCATION. 

•  PERSONl  fled  and  lay  in  ambush  to  LOCATION. 

Sample  4: 

•  PERSONl  cohabited  with  PERSON!. 

•  PERSONl  murdered  in  the  room  for  cohabitation 
with  PERSON!. 


Eigure  4:  Sample  incorrecl  paraphrases  oblained 
(Iranslaled  from  Japanese) 


Next  we  applied  a  simple  veetor  spaee  method  to  ob¬ 
tain  pairs  of  sentenees  whieh  report  the  same  event. 
After  that,  we  used  a  simple  eoreferenee  resolver  to 
identify  anehors.  Finally  we  used  a  dependeney  an¬ 
alyzer  (Kurohashi,  1998)  to  extraet  portions  of  sen¬ 
tenees  whieh  share  at  least  one  anehor. 

In  this  experiment,  we  used  a  set  of  artieles  whieh 
reports  murder  eases.  The  results  are  shown  in  Ta¬ 
ble  1.  First,  with  Topie  Deteetion  and  Traeking, 
there  were  156  eorreet  pairs  of  artieles  out  of  193 
pairs  obtained.  To  simplify  the  evaluation  proeess, 
we  aetually  obtained  paraphrases  from  the  top  20 
pairs  of  artieles  whieh  had  the  highest  similarities. 
Obtained  paraphrases  were  reviewed  manually.  We 
used  the  following  eriteria  for  judging  the  eorreet- 
ness  of  paraphrases: 

1.  They  has  to  be  deseribing  the  same  event. 

2.  They  should  eapture  the  same  information  if  we 
use  them  in  an  aetual  IE  applieation. 

We  tried  several  eonditions  to  extraet  paraphrases. 
First  we  tried  to  extraet  paraphrases  using  neither 
eoreferenee  resolution  nor  ease  restrietion.  Then  we 
applied  only  the  ease  restrietion  with  the  threshold 
0.3  <  Sp{C),  and  observed  the  preeision  went  up 
from  24%  to  56%.  Furthermore,  we  added  a  sim¬ 
ple  eoreferenee  resolution  and  the  preeision  rose  to 
62%.  We  got  23  eorreet  paraphrases.  We  found 
that  several  interesting  paraphrases  are  obtained. 
Some  examples  are  shown  in  Figure  3  (eorreet  para¬ 
phrases)  and  Figure  4  (ineorreet  paraphrases). 

It  is  hard  to  say  how  many  paraphrases  ean  be  ul¬ 
timately  obtained  from  these  artieles.  However,  it  is 
worth  noting  that  after  spending  about  5  hours  for 
this  eorpus  we  obtained  100  paraphrases  manually. 

5  Discussion 

Some  paraphrases  were  ineorreetly  obtained.  There 
were  two  major  eauses.  The  first  one  was  depen¬ 
deney  analysis  errors.  Sinee  our  method  reeognizes 
boundaries  of  expressions  using  dependeney  trees,  if 
some  predieates  in  a  tree  take  extra  arguments,  this 
may  result  in  ineluding  extraneous  portions  of  the 
sentenee  in  the  paraphrase.  For  example,  the  predi- 
eate  “lay  in  ambush”  in  Sample  3  should  have  taken 
a  different  noun  as  its  subjeet.  If  so,  the  predieate 


doesn’t  share  the  anehors  any  more  and  eould  be 
eliminated. 

The  seeond  eause  was  the  laek  of  reeognizing 
eontexts.  In  Sample  4,  we  observed  that  even  if  two 
expressions  share  multiple  anehors,  an  obtained  pair 
ean  be  still  ineorreet.  We  hope  that  this  kind  of  error 
ean  be  redueed  by  eonsidering  the  eontexts  around 
expressions  more  extensively. 

6  Future  Work 

We  hope  to  apply  our  approaeh  further  to  ob¬ 
tain  more  varied  paraphrases.  After  a  eertain 
number  of  paraphrases  are  obtained,  we  ean  use 
the  obtained  paraphrases  as  anehors  to  obtain 
additional  paraphrases.  For  example,  if  we  know 
“A  dismantle  B”  and  “A  destroy  B”  are  para¬ 
phrases,  we  eould  apply  them  to  “U.N.  reported 
Iraq  dismantling  more  missiles”  and  “U.N.  offieial 
says  Iraq  destroyed  more  Al-Samoud  2  missiles”, 
and  obtain  another  pair  of  paraphrases  “X  reports  F’ 
and  “A  says  F’. 

This  approaeh  ean  be  extended  in  the  other  diree- 
tion.  Some  entities  ean  be  referred  to  by  eompletely 
different  names  in  eertain  situations,  sueh  as  “North 
Korea”  and  “Pyongyang”.  We  are  also  planning  to 
identify  these  varied  external  forms  of  a  single  entity 
by  applying  previously  obtained  paraphrases.  For 
example,  if  we  know  “A  restarted  B”  and  “A  reae- 
tivated  B”  as  paraphrases,  we  eould  apply  them  to 
“North  Korea  restarted  its  nuclear  facility”  and  “Py¬ 
ongyang  has  reaetivated  the  atomic  facility” .  This 
way  we  know  “North  Korea”  and  “Pyongyang”  ean 
refer  to  the  same  entity  in  a  eertain  eontext. 

In  addition,  we  are  planning  to  give  some  eredi- 
bility  seore  to  anehors  for  improving  aeeuraey.  We 
found  that  some  anehors  are  less  reliable  than  oth¬ 
ers  even  if  they  are  eonsidered  as  proper  expres¬ 
sions.  For  example,  in  most  U.S.  newspapers  the 
word  “U.S.”  is  used  in  mueh  wider  eontexts  than 
word  sueh  as  “Thailand”  although  both  of  them  are 
eountry  names.  So  we  want  to  give  less  eredit  to 
these  widely  used  names. 

We  notieed  that  there  are  several  issues  in  general¬ 
izing  paraphrases.  Currently  we  simply  label  every 
Named  Entity  as  a  slot.  However  expressions  sueh 
as  “the  governor  of  LOCATION'  ean  take  only  a  eer¬ 
tain  kind  of  loeations.  Also  some  paraphrases  might 


require  a  narrower  eontext  than  others  and  are  not 
truly  interehangeable.  For  example,  “PERSON  was 
sworn”  ean  be  replaeed  with  “PERSON  took  offiee”, 
but  not  viee  versa. 

7  Conclusions 

In  this  paper,  we  deseribed  a  method  to  obtain  para¬ 
phrases  automatieally  from  eorpora.  Our  key  notion 
is  to  use  eomparable  artieles  whieh  report  the  same 
event  on  the  same  day.  Some  noun  phrases,  espe- 
eially  Extended  Named  Entities  sueh  as  names,  lo- 
eations  and  numbers,  are  preserved  aeross  artieles 
even  if  the  event  is  reported  using  different  expres¬ 
sions.  We  used  these  noun  phrases  as  anehors  and 
extraeted  portions  whieh  share  these  anehors.  Then 
we  generalized  the  obtained  expressions  as  usable 
paraphrases. 

We  adopted  dependeney  trees  as  a  format  for  ex¬ 
pressions  whieh  preserve  syntaetie  eonstraints  when 
extraeting  paraphrases.  We  generate  possible  sub¬ 
trees  from  dependency  trees  and  find  pairs  which 
share  the  anchors.  However,  simply  generating  all 
subtrees  ends  up  obtaining  many  inappropriate  por¬ 
tions  of  sentences.  We  tackled  this  problem  by  cal¬ 
culating  a  score  which  tells  us  how  plausible  ex¬ 
tracted  candidates  are.  We  confirmed  fhaf  if  con- 
fribufed  fo  fhe  overall  accuracy.  This  mefric  was 
also  useful  fo  frimming  fhe  search  space  for  mafch- 
ing  subfrees.  We  used  a  simple  coreference  resolver 
fo  handle  some  addifional  anchors  such  as  pronouns. 
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